{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <center> FPGA-ACCELERATION OF MACHINE LEARNING APPLICATIONS USING APACHE SPARK <br/> A USE CASE SCENARIO ON LOGISTIC REGRESSION <center/>\n",
    "\n",
    "# <center> Classifying Handwritten Digits <br/> with <br/> Logistic Regression </center>\n",
    "\n",
    "<img style=\"float: center; width: 450px; height: 100px;\" src=\"sample.png\">\n",
    "\n",
    "## Introduction\n",
    "In this notebook an interactive PySpark shell is loaded and our Logistic Regression application is executed, using our accelerated ML library. The accelerated ML library is written in Python. It supports standard learning algorithms, including common settings like classification, regression etc. We are given the option to choose between an accelerated execution that uses both software and hardware and a non-accelerated one, that uses only the CPU cores. Upon choosing the accelerated option, the accelerator's library is invoked (which is also written in Python) where the input data is stored in memory mapped buffers and are then transfered and processed in the PL. The whole communication with the PL is achieved using an AXI4-Stream Accelerator Adapter.\n",
    "\n",
    "___\n",
    "\n",
    "## 1. Data Sets\n",
    "\n",
    "The data are taken from the famous <a href=\"http://yann.lecun.com/exdb/mnist/\">MNIST</a> dataset. \n",
    "\n",
    "The original data file contains gray-scale images of hand-drawn digits, from zero through nine. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.\n",
    "\n",
    "In this example the data we use are already preprocessed/normalized using Feature Standardization method (<a href=\"https://en.wikipedia.org/wiki/Standard_score\">Z-score scaling</a>).\n",
    "\n",
    "The (*train* and *test*) data sets that are used below have 785 columns. The first column, called \"label\", is the digit that was drawn by the user. The rest of the columns contain the (rescaled) pixel-values of the associated image."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. PySpark initialization\n",
    "\n",
    "In this section we initialize PySpark to predefine the SparkContext variable. \\$SPARK_HOME and other needed environment variables are set under the ***/etc/environment*** file. \n",
    "\n",
    "> Make sure you have correctly set all needed paths and variables and that Py4J matches the version you have installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Welcome to\n",
      "      ____              __\n",
      "     / __/__  ___ _____/ /__\n",
      "    _\\ \\/ _ \\/ _ `/ __/  '_/\n",
      "   /__ / .__/\\_,_/_/ /_/\\_\\   version 2.1.1-SNAPSHOT\n",
      "      /_/\n",
      "\n",
      "Using Python version 3.4.3+ (default, Oct 14 2015 16:03:50)\n",
      "SparkSession available as 'spark'.\n"
     ]
    }
   ],
   "source": [
    "import sys, os\n",
    "\n",
    "spark_home = os.environ.get(\"SPARK_HOME\", None)\n",
    "\n",
    "# Add the spark python sub-directory to the path\n",
    "sys.path.insert(0, spark_home + \"/python\")\n",
    "\n",
    "# Add the py4j to the path.\n",
    "# You may need to change the version number to match your install\n",
    "sys.path.insert(0, os.path.join(spark_home + \"/python/lib/py4j-0.10.4-src.zip\"))\n",
    "\n",
    "# Initialize PySpark to predefine the SparkContext variable 'sc'\n",
    "filename = spark_home+\"/python/pyspark/shell.py\"\n",
    "exec(open(filename).read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Logistic Regression Application\n",
    "\n",
    "This example shows how our accelerated Logistic Regression library is called to train a LR model on the train set and then test its accuracy. If accel is set (***accel = 1***), the hardware accelerator is used for the computation of the gradients in each iteration.\n",
    "\n",
    "### Read data & parameters\n",
    "\n",
    "The size of the train set, as well as the number of the iterations are intentionally picked small, to avoid large execution time in SW-only cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "* LogisticRegression Application *\n",
      " # train file:               data/MNIST_train.dat\n",
      " # test file:                data/MNIST_test.dat\n"
     ]
    }
   ],
   "source": [
    "chunkSize = 4000\n",
    "alpha = 0.25\n",
    "iterations = 5\n",
    "\n",
    "train_file = \"data/MNIST_train.dat\"\n",
    "test_file = \"data/MNIST_test.dat\"\n",
    "\n",
    "sc.appName = \"Python Logistic Regression\"\n",
    "\n",
    "print(\"* LogisticRegression Application *\")\n",
    "print(\" # train file:               \" + train_file)\n",
    "print(\" # test file:                \" + test_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### HW accelerated vs SW-only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Select mode (0: SW-only, 1: HW accelerated) : 1\n"
     ]
    }
   ],
   "source": [
    "accel = int(input(\"Select mode (0: SW-only, 1: HW accelerated) : \"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Instantiate a Logistic Regression model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyspark.mllib_accel.classification import LogisticRegression\n",
    "\n",
    "trainRDD = sc.textFile(train_file).coalesce(1)\n",
    "\n",
    "numClasses = 10\n",
    "numFeatures = 784 \n",
    "LR = LogisticRegression(numClasses, numFeatures)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train the LR model "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    * LogisticRegression Training *\n",
      "     # numSamples:               4000\n",
      "     # chunkSize:                4000\n",
      "     # numClasses:               10\n",
      "     # numFeatures:              784\n",
      "     # alpha:                    0.25\n",
      "     # iterations:               5\n",
      "100% |▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥| 5/5 Time: 96.310 sec\n"
     ]
    }
   ],
   "source": [
    "weights = LR.train(trainRDD, chunkSize, alpha, iterations, accel)\n",
    "    \n",
    "with open(\"data/weights.out\", \"w\") as weights_file:\n",
    "    for k in range(0, numClasses):\n",
    "        for j in range(0, numFeatures):\n",
    "            if j == 0:\n",
    "                weights_file.write(str(round(weights[k * numFeatures + j], 5)))\n",
    "            else:\n",
    "                weights_file.write(\",\" + str(round(weights[k * numFeatures + j], 5)))\n",
    "        weights_file.write(\"\\n\")\n",
    "weights_file.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test the LR model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    * LogisticRegression Testing *\n",
      "     # accuracy:                 0.82(1640/2000)\n",
      "     # true:                     1640\n",
      "     # false:                    360\n"
     ]
    }
   ],
   "source": [
    "testRDD = sc.textFile(test_file)\n",
    "\n",
    "LR.test(testRDD)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 4. Performance metrics\n",
    "\n",
    "\n",
    "` Execution time for different execution scenarios:`\n",
    "\n",
    "Target | Time\n",
    ":--- | ---:\n",
    "`PYNQ SW-only:` | `1483.859 sec` \n",
    "`PYNQ HW accelerated:` | `96.310 sec`"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3+"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
